#
Dr. M. Baron, Statistical Machine Learning class, STAT-427/627
# CROSS-VALIDATION
1. Validation Set Approach
> library(ISLR2)
> attach(Auto); n = length(mpg);
> n
[1] 392
> Z = sample( n, 250 ) # Random subsample of size 250 (does not
have to be n/2)
# Works similarly to generating a Bernoulli
variable
> reg.fit = lm( mpg ~ weight + horsepower + acceleration, subset=Z ) # Fit using training data
> mpg_predicted =
predict( reg.fit, newdata=Auto[-Z
, ] )
#
Use this model to predict the testing data [-Z]
> plot(mpg[-Z],mpg_predicted) # We can see a nonlinear component
> abline(0,1) # Compare with the line y=x.
> mean( (mpg[-Z] - mpg_predicted)^2 ) # Estimate
the mean-squared error MSE
[1] 18.58128
2. Jackknife (Leave-One-Out Cross-Validation,
LOOCV)
“Manually”:
> Yhat = numeric(n)
> for (i in 1:n){
+ reg = lm( mpg ~ weight + horsepower + acceleration,
data=Auto[-i,] )
+ Yhat[i]
= predict( reg, newdata=Auto[i,] )
+ }
> plot(mpg,Yhat)
> mean((mpg-Yhat)^2)
[1] 18.25595
Using package “boot”:
> install.packages(“boot”)
> library(boot)
> glm.fit = glm(mpg ~ weight + horsepower + acceleration) #
Default “family” is Normal, so this GLM model
> cv.error = cv.glm(
Auto, glm.fit )
# is the same as LM, standard linear
regression.
> names(cv.error)
# This cross-validation tool has several
outputs
[1] "call" "K" "delta" "seed" # We are interested in “delta”
> error$delta
[1] 18.25595 18.25542
# Delta consists of 2 numbers – estimated prediction error and its
version adjusted for the lost sample size due to cross-validation.
# Example – what power of “horsepower” is optimal for this
prediction?
> cv.error =
rep(0,10) # Initiate a vector of estimated errors
> for (p in 1:10){ # Fit polynomial regression models
+ glm.fit = glm( mpg ~ weight + poly(horsepower,p)
+ acceleration ) # with power p=1..10
+ cv.error[p] = cv.glm( Auto, glm.fit
)$delta[1] } # Save prediction errors
> cv.error # Look at the results
[1]
18.25595 15.90163 15.72995 15.86879 15.74517 15.74989 15.70073 15.79314
15.95933 16.38301
# Although p=7 yields the lowest estimated prediction error, after
p=2, the improvement is very little.
> plot(cv.error)
> lines(cv.error)
3. K-Fold Cross-Validation
# We can specify K within cv.glm (Omitted K
is K=1 by default, which is LOOCV).
> cv.error =
rep(0,10)
> for (p in 1:10){ glm.fit
= glm( mpg ~ weight + poly(horsepower,p)
+ acceleration )
+ cv.error[p] = cv.glm( Auto, glm.fit, K=30 )$delta[1] }
> cv.error
[1]
18.22699 15.87030 15.73565 15.85911 15.80686 15.71585 16.09625 15.67797
15.82097 16.48196
> which.min(cv.error)
[1] 6
4. Cross-validation in classification problems.
Loss function.
# Command cv.glm, as we used it above,
calculates MSE, the mean squared error, and calls them “delta”.
# In classification problems, MSE can be used to measure the distance
between the response variable Y
# and the predicted probability p. For this, Y has
to be 0 or 1, and p should be the probability of Y=1.
# However, the correct classification rate and the error
classification rate are more standard measures
# of classification accuracy. We can force cv.glm
to return these measures by introducing a suitable loss
# function. For example, we’ll be predicting whether a student is
depressed or not.
# We define a
loss function L(Y,p), a function of true
response Y and predicted probability p. The loss = 1 if
# the predicted response is different from
the actual response and 0 otherwise. Suppose the threshold is 0.5
> loss = function(Y,p){
return( mean( (Y==1 & p < 0.5) | (Y==0 & p >= 0.5) ) ) }
> loss(1,0.3)
[1] 1
> loss(c(1,1),c(0.3,0.7))
[1] 0.5
# Now we attach the Depression data, skip missing values, fit
logistic regression model, and estimate
# the error classification rate by LOOCV.
> Depr =
read.csv(url("http://fs2.american.edu/~baron/627/R/depression_data.csv"))
> D = na.omit(Depr)
> attach(D)
> library(boot)
> lreg = glm(Diagnosis ~ Gender + Guardian_status
+ Cohesion_score, family="binomial")
> cv = cv.glm( D, lreg, loss )
> cv$delta[1]
[1]
0.1615721
# Is Guardian_status significant?
Let’s compare the error rates with and without variable “Guardian_status”.
> lreg = glm(Diagnosis ~ Gender + Cohesion_score,
family="binomial")
> cv.glm( D, lreg, loss )$delta[1]
[1] 0.1572052
# The error classification rate is lower without the “Guardian_status”.